Skip to content

feat: implement DataWriter for Iceberg data files#552

Open
shangxinli wants to merge 1 commit intoapache:mainfrom
shangxinli:implement-data-file-writer
Open

feat: implement DataWriter for Iceberg data files#552
shangxinli wants to merge 1 commit intoapache:mainfrom
shangxinli:implement-data-file-writer

Conversation

@shangxinli
Copy link
Contributor

Implements DataWriter class for writing Iceberg data files as part of issue #441 (task 2).

Implementation:

  • Factory method DataWriter::Make() for creating writer instances
  • Support for Parquet and Avro file formats via WriterFactoryRegistry
  • Complete DataFile metadata generation including partition info, column statistics, serialized bounds, and sort order ID
  • Proper lifecycle management with Initialize/Write/Close/Metadata
  • PIMPL idiom for ABI stability

Related to #441

@shangxinli shangxinli force-pushed the implement-data-file-writer branch from 8944a75 to a201953 Compare January 31, 2026 17:59

ICEBERG_ASSIGN_OR_RAISE(writer_,
WriterFactoryRegistry::Open(options_.format, writer_options));
return {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is odd that an empty structure is always returned. Also, since this is initialization why not doing in the ctor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the initialization logic

Comment on lines 62 to 58
if (closed_) {
return InvalidArgument("Writer already closed");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see a case for making close idempotent, is there any strong reason why we want to return this error instead of no op for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

return InvalidArgument("Writer already closed");
}
ICEBERG_RETURN_UNEXPECTED(writer_->Close());
closed_ = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this class address thread safety?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I've added explicit documentation that this class is not thread-safe:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a single writer (or reader) should support thread safety so it is fine not to add comment like this.

Comment on lines 78 to 109
TEST_F(DataWriterTest, CreateWithParquetFormat) {
DataWriterOptions options{
.path = "test_data.parquet",
.schema = schema_,
.spec = partition_spec_,
.partition = PartitionValues{},
.format = FileFormatType::kParquet,
.io = file_io_,
.properties = {{"write.parquet.compression-codec", "uncompressed"}},
};

auto writer_result = DataWriter::Make(options);
ASSERT_THAT(writer_result, IsOk());
auto writer = std::move(writer_result.value());
ASSERT_NE(writer, nullptr);
}

TEST_F(DataWriterTest, CreateWithAvroFormat) {
DataWriterOptions options{
.path = "test_data.avro",
.schema = schema_,
.spec = partition_spec_,
.partition = PartitionValues{},
.format = FileFormatType::kAvro,
.io = file_io_,
};

auto writer_result = DataWriter::Make(options);
ASSERT_THAT(writer_result, IsOk());
auto writer = std::move(writer_result.value());
ASSERT_NE(writer, nullptr);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The two tests are quite similar, it is probably possible to leverage a function to reduce duplication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated the two tests using parameterized testing.

// Check length before close
auto length_result = writer->Length();
ASSERT_THAT(length_result, IsOk());
EXPECT_GT(length_result.value(), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: check the size of the data passed to the write function?

Copy link
Contributor Author

@shangxinli shangxinli Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Comment on lines 45 to 47
if (!writer_) {
return InvalidArgument("Writer not initialized");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!writer_) {
return InvalidArgument("Writer not initialized");
}
ICEBERG_PRECHECK(writer_, "Writer not initialized");

nit, this should make the code shorter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced all manual null checks with ICEBERG_PRECHECK

}

Result<FileWriter::WriteResult> Metadata() {
if (!closed_) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use ICEBERG_CHECK here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG

EXPECT_GT(length.value(), 0);
}

} // namespace
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move this closing namespace curly before the first TEST_F?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@shangxinli shangxinli force-pushed the implement-data-file-writer branch 2 times, most recently from 90d324e to 153d763 Compare February 7, 2026 01:31
Implements DataWriter class for writing Iceberg data files as part of
issue apache#441 (task 2).

Implementation:
- Static factory method DataWriter::Make() for creating writer instances
- Support for Parquet and Avro file formats via WriterFactoryRegistry
- Complete DataFile metadata generation including partition info,
  column statistics, serialized bounds, and sort order ID
- Proper lifecycle management with Write/Close/Metadata methods
- Idempotent Close() - multiple calls succeed (no-op after first)
- PIMPL idiom for ABI stability
- Not thread-safe (documented)

Tests:
- 13 comprehensive unit tests including parameterized format tests
- Coverage: creation, write/close lifecycle, metadata generation,
  error handling, feature validation, and data size verification
- All tests passing (13/13)

Related to apache#441
@shangxinli shangxinli force-pushed the implement-data-file-writer branch from 153d763 to 147f25b Compare February 7, 2026 01:34
class DataWriter::Impl {
public:
static Result<std::unique_ptr<Impl>> Make(DataWriterOptions options) {
WriterOptions writer_options;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use aggregate initialization for writer_options

}

Status Write(ArrowArray* data) {
ICEBERG_PRECHECK(writer_, "Writer not initialized");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this check ever fail? If not, should we remove the check or use ICEBERG_DCHECK instead? Same question for below.

return InvalidArgument("Writer already closed");
}
ICEBERG_RETURN_UNEXPECTED(writer_->Close());
closed_ = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a single writer (or reader) should support thread safety so it is fine not to add comment like this.

}

Result<FileWriter::WriteResult> Metadata() {
ICEBERG_PRECHECK(closed_, "Cannot get metadata before closing the writer");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ICEBERG_PRECHECK(closed_, "Cannot get metadata before closing the writer");
ICEBERG_CHECK(closed_, "Cannot get metadata before closing the writer");

We should return invalid state instead of invalid argument in this case.

data_file->file_path = options_.path;
data_file->file_format = options_.format;
data_file->partition = options_.partition;
data_file->record_count = metrics.row_count.value_or(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data_file->record_count = metrics.row_count.value_or(0);
data_file->record_count = metrics.row_count.value_or(-1);

Java impl uses -1 when row count is unavailable.

auto split_offsets = writer_->split_offsets();

auto data_file = std::make_shared<DataFile>();
data_file->content = DataFile::Content::kData;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use aggregate initialization


// Convert metrics maps from unordered_map to map
for (const auto& [col_id, size] : metrics.column_sizes) {
data_file->column_sizes[col_id] = size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it makes sense to change DataFile and Metrics classes to use std::map or std::unordered_map consistently so we don't need to use a for-loop here?

cc @zhjwpku

Comment on lines +56 to +57
SchemaField::MakeRequired(1, "id", std::make_shared<IntType>()),
SchemaField::MakeOptional(2, "name", std::make_shared<StringType>())});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SchemaField::MakeRequired(1, "id", std::make_shared<IntType>()),
SchemaField::MakeOptional(2, "name", std::make_shared<StringType>())});
SchemaField::MakeRequired(1, "id", int32()),
SchemaField::MakeOptional(2, "name", string())});


using ::testing::HasSubstr;

class DataWriterTest : public ::testing::Test {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to consolidate the test cases since each of them only test a tiny api with repeated boilerplate of creating writer and writing data? This may lead to test cases explosion if more and more cases are like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants